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Abstract 

1  In  this  paper,  we  propose  an  implicit  neu¬ 
ral  network  architecture  and  show  that  it  can 
be  computed  in  a  reasonably  efficient  manner. 

Our  architecture  relaxes  the  causality  assump¬ 
tion  in  formulating  recurrent  neural  networks, 
so  that  the  hidden  states  of  the  network  are 
coupled  together,  in  order  to  improve  perfor¬ 
mance  on  complex,  long-range  dependencies 
in  either  direction  of  a  sequence.  We  contrast 
our  architecture  with  a  bidirectional  RNN,  and 
show  that  our  proposed  architecture  the  bidi¬ 
rectional  network  matches  it’s  performance  on 
one  task,  while  providing  an  ensembling  ben¬ 
efit  greater  than  ensembling  multiple  bidirec¬ 
tional  networks. 

1  Introduction 

Feedforward  neural  networks  were  designed  to  approx¬ 
imate  and  interpolate  functions.  Recurrent  networks 
were  developed  to  predict  sequences.  These  recur¬ 
rent  networks  can  be  ‘unwrapped’,  and  thought  of  as 
a  very  deep  feedforward  network,  with  each  layer  shar¬ 
ing  the  same  set  of  weights.  Computation  proceeds  one 
step  at  a  time,  like  the  trajectory  of  an  ordinary  dif¬ 
ferential  equation  when  solving  an  initial  value  prob¬ 
lem.  However,  in  certain  applications  in  natural  lan¬ 
guage  processing,  especially  those  with  long-distance 
effects,  and  where  grammar  matters,  sequence  predic¬ 
tion  may  be  better  thought  of  as  a  boundary  value  prob¬ 
lem,  because  information  flows  inward  from  the  bound¬ 
ary  and  creates  a  strongly  coupled  system.  The  bidi¬ 
rectional  network  addresses  this  problem  by  allowing 
information  to  flow  in  both  directions.  However,  this 
still  equates  to  running  two  ‘for’  loops  through  the 
data.  Many  algorithms  in  practice  require  more  than 
two  passes  through  the  data  to  determine  the  answer. 
We  attempt  to  provide  a  slightly  different  mechanism 
to  the  bidirectional  network  where  our  motivation  is  a 
program  which  iterates  over  itself  until  convergence. 
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Figure  1:  Traditional  recurrent  neural  network  struc¬ 
ture. 

1.1  Related  Work 

Long-range  dependencies  have  been  an  issue  as  long 
as  there  have  been  NLP  tasks,  and  there  are  many  ef¬ 
fective  approaches  to  dealing  with  them.  In  the  con¬ 
text  of  Hidden  Markov  models  (HMMs),  there  are 
the  “Forward-Backward”  models.  In  information  ex¬ 
traction,  there  are  non-local  sequence  models  that  use 
Gibbs  sampling  (Finkel  et  al.,  2005),  which  is  the  pa¬ 
per  most  similar  to  this  work.  In  recent  years,  the  pop¬ 
ularity  has  soared  for  the  Long  Short-Term  Memory 
(LSTM)  (Hochreiter  and  Schmidhuber,  1997)  and  vari¬ 
ants  such  as  Gated  Recurrent  Unit  (GRU)  (Cho  et  al., 
2014),  which  enabled  recurrent  neural  networks  to  pro¬ 
cess  long  sequences  without  the  problem  of  vanish¬ 
ing  or  exploding  gradients,  and  thus  retain  informa¬ 
tion  about  dependencies.  The  Bidirectional  LSTM  (b- 
LSTM)  (Graves  and  Schmidhuber,  2005),  a  natural  ex¬ 
tension  of  (Schuster  and  Paliwal,  1997),  incorporates 
past  and  future  hidden  states  via  two  separate  recur¬ 
rent  networks,  allowing  information/gradients  to  flow 
in  two  directions. 

2  The  Implicit  Recurrent  Neural 
Network 

2.1  Assumptions  of  Recurrent  Neural  Networks 

A  typical  recurrent  neural  network  has  an  input  se¬ 
quence  [xi,  £2,  •  •  • ,  xs]  and  initial  state  ho  and  itera- 
tivly  produces  future  states: 

hi  =  f(xi,h0) 
h2  =  f(x2,  hi) 

hs  =  f(xa,h8- 1) 


Figure  2:  Proposed  implicit  network  architecture 


hi  =  f(h0,h2,€i) 

hi  f  (hi— i ,  hi-^-i ,  <^) 

hn  =  f  (hn— 1 ,  ^n+l  5  £n) 

2.3  Computing  the  forward  pass 

To  evaluate  the  network,  we  must  solve  the  equation 
H  =  F(H).  We  computed  this  via  an  approximate 
Newton  solve: 


The  LSTM,  GRU,  and  related  variants  follow  this 
formula,  with  different  choices  for  the  state  transi¬ 
tion  function.  Computation  proceeds  left-to-right,  with 
each  next  state  depending  only  on  inputs  and  previ¬ 
ously  computed  hidden  states.  In  this  work,  we  in¬ 
vestigate  relaxing  this  assumption  by  allowing  ht  = 
f(xt,  ht-i,  ht+i)2.  This  leads  to  an  implicit  set  of 
equations  for  the  entire  sequence  of  hidden  states, 
which  can  be  thought  of  as  a  single  tensor  H : 

H  =  [h0,hi,h2  ,...,hs] 

This  yields  a  system  of  nonlinear  equations.  This 
setup  has  the  potential  for  arriving  at  nonlocal,  whole- 
sequence  dependent  results.  As  an  additional  motiva¬ 
tion,  we  may  also  wonder  if  such  a  system  is  more 
‘stable’,  whereby  the  predicted  sequence  may  drift  less 
from  the  true  meaning,  since  errors  will  not  compound 
with  each  time  step  in  the  same  way. 

2.2  Proposed  Architecture 

There  are  many  potential  ways  to  architect  a  neural  net¬ 
work  -  in  fact,  this  flexibility  is  one  of  deep  learning’s 
best  features  -  but  we  restrict  our  discussion  to  the  one 
depicted  in  Figure  2.  In  this  setup,  we  have  the  follow¬ 
ing  variables: 


data 
labels 
parameters 
transformed  input 
hidden  layers 


X 

Y 
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and  functions: 


Hn+1  =  Hn  —  (/  —  VHF)-\Hn  -  F(Hn )) 

Since  (/  —  V#F)  is  sparse,  we  apply  Krylov  sub¬ 
space  methods  (Knoll  and  Keyes,  2004),  specifically 
BiCG-Stab  method  (Van  der  Vorst,  1992),  since  the 
system  is  non- symmetric.  This  has  the  added  advan¬ 
tage  of  only  relying  on  matrix- vector  multiplies  of  the 
gradient  of  F,  which  can  be  conveniently  and  effi¬ 
ciently  computed  via  the  Theano  (Bergstra  et  al.,  201 1) 
operator  Rop. 

We  also  considered  approximating  the  inverse  via 
a  polynomial  Pn(V#F),  and  used  the  geometric  se¬ 
ries  Pn(x)  =  1  +  x  +  x2  +  . . .  +  xn ,  which  con¬ 
verges  provided  that  ||V#F||  <  1.  This,  with  n  =  1, 
proved  a  reasonable  approximation  for  the  task  of  Part- 
of-Speech  tagging  (Section  3.1). 

However,  for  other  tasks,  we  found  the  eigenvalues 
of  the  system  were  not  as  well-behaved. 

2.4  Gradients 

In  order  to  train  the  model,  we  perform  gradient  de¬ 
scent.  Taking  the  gradient  of  the  loss  function: 


\7qL  =  \7o£  + 

so  we  will  need  to  know  the  gradient  of  the  hidden 
units  with  respect  to  the  parameters,  which  we  can  find 
via  the  implicit  definition: 


0 

(I-VHF)VeH 

YeH 


YeH  -  \7eF  -  \/hFY6H  - 
V0F  + 


loss  function 

implicit  hidden  layer  definition 

input  layer  transformation 


L  =  £(0,H,Y) 
H-F(0,£,H)=  0 

t  =  g{fi,x) 


The  entire  gradient  is  then 


=  V^  +  V^I-V^F)-1  (V*F  + 


Our  implicit  definition  function,  F,  is  made  up  of 
local  state  transitions,  and  forms  a  system  of  nonlinear 
equations  that  we  need  to  solve: 

2 A  wider  stencil  can  also  be  used,  e.g.  f(ht-2,  ht- 1, . . .). 


Once  again,  the  inverse  of  I  —  VhF  appears,  and  we 
can  compute  it  via  Krylov  subspace  methods.  Since  it 
is  more  efficient  to  compute  V//^(I  —  V#F)-1  first, 
our  solver  in  this  case  uses  the  Theano  Lop  operator. 


2.5  Transition  Functions 


0.08 


Recall  the  original  GRU  equations,  with  slight  nota- 
tional  modifications: 


final  hidden 
candidate  hidden 
update  weight 
reset  gate 


ht  —  (1  —  Zt)ht  +  Zfht 

ht  =  tdinh(Wxt  +  U(rtht)  +  b) 

zt  =  cr(Wzxt  +  Uzht  +  bz) 

Tt  =  cr(Wrxt  +  Urht  +  br) 


To  make  this  implicit  and  bidirectional,  we  let  ht 
(defined  as  ht-i  in  (Cho  et  al.,  2014))  be  a  linear  com¬ 
bination  of  previous  and  future  hidden  states  via  the 
switch  variable  s.  s  is  in  turn  determined  by  a  competi¬ 
tion  between  two  sigmoidal  units  sp  and  sn9  represent¬ 
ing  the  contributions  of  the  previous  and  next  hidden 
states,  respectively. 

state  combination 
switch 

previous  switch 
next  switch 

3  Experiments 

3.1  Part-of-speech  tagging 

Next  we  apply  our  model  to  a  real  world  problem. 
Part  of  speech  tagging  fits  naturally  in  the  sequence 
labeling  framework,  and  has  the  advantage  of  a  stan¬ 
dard  dataset  that  we  can  use  to  compare  our  network 
with  other  techniques.  To  train  a  part-of-speech  tag¬ 
ger,  we  simply  let  L  be  a  softmax  layer  transforming 
each  hidden  unit  output  into  a  part  of  speech  tag.  Ini¬ 
tially,  £  consisted  only  of  word  vectors  for  39,000  case- 
sensitive  vocabulary  words.  Next,  we  lowercased  the 
vocabulary  words,  but  added  a  single  feature  indicat¬ 
ing  whether  case  appeared  in  the  data.  Third,  we  added 
six  additional  ‘word  vector’  components  to  encode  the 
top-2000  most  common  prefixes  and  suffixes  of  words, 
for  affix  lengths  2  to  4.  Finally,  we  added  in  other 
(binary)  features  to  indicate  the  presence  of  numbers, 
symbols,  punctuation,  and  more  rich  case  data,  as  used 
by  (Huang  et  al.,  2015). 

We  trained  the  Part  of  Speech  (POS)  tagger  on  the 
Penn  Treebank  Wall  Street  Journal  corpus  (Marcus 
et  al.,  1993),  blocks  0-18,  validated  on  19-21,  and 
tested  on  22-24,  and  compared  it  to  the  results  of 
the  off-the-shelf  Stanford  Part-of-Speech  tagger.  The 
results  are  indicated  in  Table  1.  We  were  able  to 
achieve  comparable  results,  and  as  Manning  notes,  per¬ 
formance  gains  past  that  point  are  quite  difficult,  due 
to  errors/inconsistencies  in  the  dataset,  ambiguity,  and 
very  difficult  linguistics,  sometimes  with  dependencies 
across  sentences  (Manning,  2011). 

Training  was  done  using  stochastic  gradient  descent, 
with  an  initial  learning  rate  of  0.5,  and  a  batch  size  of 
20.  Word  vectors  were  of  dimension  200,  prefix  and 
suffix  vectors  were  of  dimension  12.  Hidden  unit  size 
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Figure  3:  Visualization  of  the  switch  variable.  Posi¬ 
tive  values  indicate  a  right-to-left  flow  of  information, 
while  negative  values  indicate  left-to-right.  Note  that 
‘Tokyo’  is  used  to  modify  ‘market’,  instead  of  being 
a  noun,  and  thus  needs  information  from  ‘market’  to 
make  the  correct  determination. 


was  equal  to  feature  input  size,  so  in  this  case,  280. 
Training  this  way  is  takes  about  5  seconds  per  batch. 
Batching  the  nonlinear  solver  was  slightly  tricky  -  it 
was  straightforward  to  perform  the  same  BiCG-stab 
computations  across  different  elements  in  the  batch, 
but  different  elements  converged  quicker  than  others, 
and  some  elements  required  restarting  from  a  different 
random  initialization.  For  part-of-speech  tagging,  how¬ 
ever,  most  of  the  elements  were  well-behaved. 

We  also  visualized  some  of  the  outputs  of  the 
“switch”  variables  for  various  sentences.  The  switch 
is  made  up  of  many  features,  so  it  does  not  necessarily 
always  correspond  to  human  judgment,  but  by  taking 
the  average,  one  can  get  a  sense  of  the  flow  of  informa¬ 
tion.  In  Figure  3,  we  see  a  visualization  of  the  switch 
on  a  very  simple  sentence,  and  in  Figure  4  we  see  it  in 
action  over  a  more  complicated  sentence.  Interestingly, 
phrasal  structures  emerge. 
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Figure  4:  A  more  complicated  example,  but  that  is 
actually  more  representative  of  the  types  of  sentences 
within  the  WSJ  corpus.  Note  the  entire  clause  about 
‘Health  Care  Property  Investors  Inc.’  propagates  in¬ 
formation  from  the  right  -  without  the  Inc,  it  may  not 
realize  it  is  the  name  of  a  company.  Also  of  note 
are  phrases  “offering  of  200,000  shares”  and  “Merrill 
Lynch  Capital  Markets.”  This  graphic  was  done  using 
the  case-insensitive  model. 


Tagger 

WSJ  Accuracy 

Word  vectors  only 

0.9626 

Single  case  feature 

0.9650 

Ensemble  of  above  (2) 

0.9683 

Affix  word- vectors 

0.9714 

Case+Symbol  feats 

0.9730 

Ensemble  without  case 

0.9731 

Ensemble  with  case 

0.9736 

Stanford  POS  Tagger 

0.9732 

Table  1 :  Tagging  performance. 

Architecture 

WSJ  Accuracy 

GRU 

96.51 

LSTM 

96.53 

Bidirectional  GRU 

97.26 

b-LSTM 

97.27 

Implicit 

97.30 

b-LSTM  Ensemble 

97.31 

Implicit  Ensemble 

97.36 

Implicit  +  b-LSTM 

97.40 

Table  2:  Tagging  performance  relative  to  other  recur¬ 
rent  architectures. 
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